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CONTEXT TREES 
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Abstract. The goal of this paper is to study the similarity between sequences 
using a distance between the context trees associated to the sequences. These 
trees are defined in the framework of Sparse Probabilistic Suffix Trees (SPST), 
and can be estimated using the SPST algorithm. We implement the Phyl-SPST 
package to compute the distance between the sparse context trees estimated with 
the SPST algorithm. The distance takes into account the structure of the trees, 
and indirectly the transition probabilities. We apply this approach to reconstruct 
a phylogenetic tree of protein sequences in the globin family of vertebrates. We 
compare this tree with the one obtained using the well-known PAM distance. 



1. Introduction 

In this work we propose to use the framework of Sparse Probabihstic Suffix Trees 
(SPST) to analyze the similarity between sequences and to infer the evolution of 



protein families. SPST was first introduced in Leonardi and Galves (2005) as a 



generalization of the PST algorithm, proposed in Ron et al. (1996). SPST has 



shown to be useful in protein modeling and classification, performing better than 



the PST algorithm (Leonardi 2006). The model that inspired the SPST algorithm is 



a generalization of Variable Length Markov Chains (VLMC), introduced by Rissanen 



(1983), and takes into account the property of sparseness of the sequences. Given 
a sequence, SPST estimates a set of sparse contexts. A sparse context is a short 
sequence of sub-sets of symbols (in a given alfabet) that are relevant to predict 
any symbol in the sequence, given that the preceding symbols belong to the sub- 
sets of the context. The SPST algorithm also estimates the transition probabilities 
associated to each context. The transition probabilities give the probability of each 
symbol conditioned on the fact that the preceding symbols belong to the sparse 
context. 

An interesting property of the set of sparse contexts is that it induces a partition 
of the set of all possible sequences and can be represented as a tree. We use this 
partition property to define a distance between context trees. This distance can be 
used to measure the similarity between protein sequences. 
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To our knowledge it has not been proposed yet in the hterature a method for 
sequence comparison using the information contained in the architecture of the con- 
text trees associated to the sequences. The more closely related approaches proposed 
until date are those that model the sequences as first order Markov chains and use a 



statistical measure to infer the similarity between them (Wu et al. 2001 Pham and 



Zuegg ; 2004 ) . The more remarkable difference between these approaches and our is 
that we do not use directly the estimated probabilities of the model. Instead of that 
we use the context tree architecture, that is trivial in first order Markov chains. We 
show here that the context tree architecture can have important structural informa- 
tion that may be useful to measure the similarity between sequences. 

The paper is organized as follows. In Section 2 we review some definitions in the 
framework of SPST. In Section 3 we introduce the distance between sparse trees. In 
Section 4 we present the results obtained for the globin protein family of vertebrates 
and finally in Section 5 we discuss some aspects of our method. 

2. Sparse Context Trees 

Let A be a finite alphabet (for example, the set of twenty amino acids) of size |y4|. 
We will denote by Va the set of parts of A. That is, 

Va = {v:v^ A]. 

The elements in P^ will be denoted hy w = {w-j, . . . , W-i). On the other hand, we 
will denote by V^ the set of all finite sequences of elements in Va', that is, 

oo 

Definition 2.1. Let {Xt)teN be a stochastic process taking values on the finite 
alphabet A. We will say that the process (Xt)tgpj is a sparse stochastic chain if 
there exists a set r C V\ such that: 

(1) For any sequence Xq, . . . ,Xn satisfying 

IP[-^0 = Xq, . . . , Xn-l = Xn-l] > 0, 

there exists an element {w-k, . . . , w_i) G r such that 

^[Xn = Xn\Xn-l = Xn-1, ■ ■ ■ , Xq = Xq] = 

P[X„ = x„|X„_i Git;_i,...,X„_feGw_fc]. (2.2) 

(2) If (w_fc, . . . , w_i) and (w)_^, . . . , w_i) belong to r and there exists j such that 
w_i n w_j 7^ for i = 1, . . . , j, then w_j = w_j for i = 1, . . . ,j. 
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Figure 1 . Examples of sparse trees over the alphabet A = {a, b, c, d}. 
(a) The index of the variables grows in the direction from the 
leaves to the root. In this case, the set of sparse contexts is 
{{{a,b,c},{a,c}),{{d},{a,c}),{{b,d})}. (c) Maximum between the 
trees in (a) and (b). 

(3) The set r is the minimum that satisfies 1. and 2. That is; if f satisfies 1. 
and 2. then, for any (w)_fc, . . . , W-i) G r there exists (w-fc, • • • , w^i) G r such 
that k > k and Wj C Wj for all j = 1, . . . , k. 

Each sequence {w-k, ■ ■ ■ , W-i) G r is called sparse context and the set r is called 
sparse context tree. This name is justified because the set of sparse contexts can 
be represented as a rooted tree. In this tree, each context w = {w-k, ■ ■ ■ , W-i) is 
represented by a complete branch, in which the first node on top is w_i and so on 
until the last element W-k which is represented by the terminal node of the branch 

(Fig.g. 

Recently, it was proposed an algorithm to estimate the set of sparse contexts and 
the transition probabilities given by 2.2 (Leonardi and Galves 2005 Leonardi; 2006). 



This algorithm represents internally the set of sparse contexts as a tree, as described 
above. We believe that this tree contains important structural information that can 
be used to measure the similarity between sequences. Our goal in this paper is to 
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show some results concerning this conjecture. With this aim we propose to use a 
distance between sparse context trees to measure the relatedness between symbohc 
sequences. This distance is defined in the next section. 



3. A METRIG SPAGE OF SPARSE TREES 

Given a sparse context w = {w^k, ■ ■ ■ ,""^-1) we denote by l{w) its length, that is 
l{w) = k. We use the notation s{w) for the product of the cardinals of the Wi's, that 
is 

l(w) 
i=l 

where \wi\ is the number of symbols in Wi. 

Given two sparse contexts w = {w-k, ■ ■ ■ , W-i) and w = {w_k, . . . , W-i) we define 
the intersection between w and w (assuming without loss of generality that k > k) 
hj w n w = (w_fc, . . . , w_(fc+i), w_fc n w_^, . . . , W-i n w_i), if Wj n Wj 7^ for all 
i = 1, . . . ,k. In the case lUj fl iDj = for some i = 1, . . . ,k we define w niu = ^. 

Given two sparse trees r = {w^, . . . , w"^} and f = {w^, . . . , w"^}, we define the 
maximum between r and f by 

r\/f = {w"- nw^ I w'' n w^ j^ (/); i = 1, . . . , n; j = 1, . . . , m}. 

The maximum between the trees of Figure [IVa)-(b) can be seen in Figure Il](c). 
Before defining the distance between sparse context trees we introduce the notion 



of /3-entropy of a tree r. Following Simovici and Szymon (2006) we define, for all 
/3>0, 



ui£r 

and 



^M^) = - E ^(^) 1^1"'^"^ ■ i°g2 Hw) |A|-'H] , if ^ = 1. 



toGr 



Then, given two sparse trees, r and r, we define the /3-distance between them as 

dp{T, f ) = 2 n^ir A f ) - Hpir) - Hp{f). (3.1) 

It can be seen that dni-, ■) defines a distance over the set of all context trees. The 



''PV-, 



proof of this assertion can be found in Simovici and Szymon (2006). 
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PAM vs SPST 
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Figure 2. Comparison of the SPST and PAM distance matrices. 



4. Results 

We implemented an algorithm coded in C, called Phyl-SPST, to calculate dis- 
tances between context trees, as defined by (3.1). The source code and compiled 



versions for Mac OS X, Linux/Unix and Windows can be downloaded from the site 



|htt p : //www . ime . usp . br7nuniec/sof twares/phyl-spst/'. 

We applied the Phyl-SPST package to study the similarity between the protein 
sequences of the globin family of vertebrates. The 41 sequences used in this analysis 



were obtained from the SCOP database (Andreeva et al. 2004) and can be found in 



the supplementary material. The program estimated, for each sequence in this set, 
a sparse context tree. Then it computed the distance matrix using the /3-distance 



defined by (3.1 ). In what follows we call this distance the SPST distance. In order to 



compare our method with an alignment-based distance we used the structure based 
alignment of the 41 globin sequences of vertebrates present in the PALI database 



(Gowri et al. ; 2003) (alignment available in supplementary material). Then, we 



applied the algorithm PROTDIST of the Phylip3.65 package ( Felsenstein ; 2004) 



with the Dayhoff PAM matrix option, to compute the distance matrix. 
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Figure 3. Phylogenetic trees made with Neighbor Joining clustering 
algorhitm on SPST distances (a) and on PAM distances (b) 
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When the PAM and SPST distances are plotted against each other (Fig. [2]) a 
non hnear relation is clearly observed. With each distance matrix we reconstructed 
a phylogenetic tree using the NEIGHBOR and DRAWGRAM algorithms of the 
Phylip3.65 package. These phylogenetic trees can be seen in Figure |3j In both trees 
the lamprey globin was used as out group. 

5. DISCUSSION 

The dataset we used to verify the potential use of the SPST distances on phylo- 
genetic reconstruction is a vertebrate subset of the globin gene family. This family 



is one of the first protein families that was characterized (Dayhoff; 1972) and is, 



perhaps, the most known to date (Vinogradov et al. 2006). Besides, the vertebrate 



phylogeny is also well studied and is ground in relatively abundant paleontological. 



morphological, molecular, and physiological analyses (Cotton and Page 2002). 

The phylogenetic tree shown in Fig. [sta) proves that in fact the context trees 
inferred from symbolic sequences (in this case, protein sequences) can offer impor- 
tant evolutionary information of the sequences. This constitutes an original and 
very promise aspect of the modeling of sequences by variable memory stochastic 
processes, and it needs to be studied in more details. 

The phylogenetic analysis here performed also reflects the overall behavior of the 
SPST distance. The tree produced with the SPST present larger branches in the 
most inclusive sequences, and shorter branches in the most basal sequences. With 
respect to the tree topology, the main differences between them is the placement of 
the myoglobin cluster, that is closer to the beta chain of hemoglobin in the SPST 
tree and, in the PAM tree, it is outside of the hemoglobin chain. Other remarkable 
difference is the placement of the red tail deer {Odocoileus virginianus) outside 
the cluster that contains the mammals, a reptile {Geochelone gigantea), and a bird 
[Gallus gallus) in the beta chain cluster of the SPST tree. Although there are minor 
misplacements in the tree based on PAM distances with respect to the vertebrate 
and globin traditional phylogenies, it is superior in reconstructing the phylogeny 
than with the use of SPST distances. 

The relationship between the SPST distance and the classical PAM distance of the 
globin family of vertebrates shows a plateau behavior. The short PAM distances 
yields larger SPST distances, and the opposite occurs when distances are longer. 
This may be caused by the bounded nature of the context trees and by the specific 
form of the distance we propose. Therefore, this analysis shows that small differences 
in sequences causes enough changes in the context trees to increase the SPST dis- 
tance between them. It remains yet as an open problem the characterization of the 
changes produced in the context trees by stationary modifications of the sequences 
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as mutations, insertions or deletions. We think that these characterizations could 
help to improve the results shown here. On the other hand, it is also important to 
define and test other distances over the set of trees to study their specific behaviors 
and compare them to the one proposed here. 
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